The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification
نویسندگان
چکیده
This research presents and compares the impact of text preprocessing, which has not been addressed before, on Arabic text classification using popular text classification algorithms; Decision Tree, K Nearest Neighbors, Support Vector Machines, Naïve Bayes and its variations. Text preprocessing includes applying different term weighting schemes, and Arabic morphological analysis (stemming and light stemming). We implemented and integrated Arabic morphological analysis tools within the leading open source machine learning tools: Weka, and RapidMiner. Text Classification algorithms are applied on seven Arabic corpora (3 in-house collected and 4 existing corpora). Experimental results show: (1) Light stemming with term pruning is best feature reduction technique. (2) Support Vector Machines and Naïve Bayes variations outperform other algorithms. (3) Weighting schemes impact the performance of distance based classifier.
منابع مشابه
Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملOptimising Sentiment Classification using Preprocessing Techniques
Sentiment Classification refers to the computational techniques for classifying whether the sentiments of text are positive or negative. Sentiment Classification being a specialized domain of text mining is expected to benefit after preprocessing. In this paper we propose various models with selective combinations of preprocessing techniques and Sentiment Classifiers, to optimize Sentiment Clas...
متن کاملPre Processing Techniques for Arabic Documents Clustering
Clustering of text documents is an important technique for documents retrieval. It aims to organize documents into meaningful groups or clusters. Preprocessing text plays a main role in enhancing clustering process of Arabic documents. This research examines and compares text preprocessing techniques in Arabic document clustering. It also studies effectiveness of text preprocessing techniques: ...
متن کاملArabic Text Classification Using Decision Trees
1 Text mining draw more and more attention recently, it has been applied on different domains including web mining, opinion mining, and sentiment analysis. Text pre-processing is an important stage in text mining. The major obstacle in text mining is the very high dimensionality and the large size of text data. Natural language processing and morphological tools can be employed to reduce dimens...
متن کامل